The limited role of non-native contacts in folding pathways of a lattice protein 
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Models of protein energetics which neglect interactions between amino acids that are not 
adjacent in the native state, such as the Go model, encode or underlie many influential ideas on 
protein folding. Implicit in this simplification is a crucial assumption that has never been critically 
evaluated in a broad context: Detailed mechanisms of protein folding are not biased by non-native 
contacts, typically imagined as a consequence of sequence design and/or topology. Here we present, 
using computer simulations of a well-studied lattice heteropolymer model, the first systematic test 
of this oft-assumed correspondence over the statistically significant range of hundreds of thousands 
of amino acid sequences, and a concomitantly diverse set of folding pathways. Enabled by a 
novel means of fingerprinting folding trajectories, our study reveals a profound insensitivity of the 
order in which native contacts accumulate to the omission of non-native interactions. Contrary 
to conventional thinking, this robustness does not arise from topological restrictions and does not 
depend on folding rate. We find instead that the crucial factor in discriminating among topological 
pathways is the heterogeneity of native contact energies. Our results challenge conventional 
thinking on the relationship between sequence design and free energy landscapes for protein folding, 
and help justify the widespread use of Go-like models to scrutinize detailed folding mechanisms of 
real proteins. 
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I. INTRODUCTION 

Current understanding of protein folding has been 
strongly shaped by theoretical and computational stud- 
ies of simplified models 1 . Such models are typically con- 
structed by discarding fine details of molecular structure 
or by making simplifying assumptions about the ener- 
gies of interaction among amino acid residues. A spe- 
cial class of models, based on Go's insights^, asserts that 
only a subset of interactions, those between segments of a 
protein that contact one another in the native state, are 
crucially important for folding. The Go model further 
assumes a unique energy scale for these native contacts. 
Here, we will focus on elaborated "Go-like" models that 
allow for a diversity of native contact energies. 

Neglect of non-native contacts offers substantial com- 
putational relief to numerical simulations, allowing thor- 
ough kinetic and thermodynamic studies to b e per- 
formed even for detailed molecular representations^'^'. 
It further establishes a basis for theories that fo- 
cus on gaps in the spectrum of conformational 
energies^' 8 - ' and the f unnel- like nature of potential energy 
landscape d 9 ' 10 ' 11 ' 12 ' 13 ! Corroborated by experiment, con- 
cepts intrinsic to and inspired by Go-like models now 
form a c anon o f widely accepted ideas about how pro- 
teins fokffESE] 
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The Go model was originally proposed as a schematic 
but microscopic perspective on the stability and ki- 
netic accessibility of proteins' native states. It accord- 
ingly provided generic insight into issues of cooperativity, 
nucleation, and the relationship between sequence and 
structure^. Recent studies have ascribed a much more 
literal significance to the detailed dynamical pathways 
defined by Go-like models^. In particular, direct compar- 
isons have been drawn between folding mechanisms pre- 
dicted by Go-like models for spe cific pro teins and those 
suggested by experimental results^' 17 ' 18 '. However, it is 
not clear to what extent such a detailed correspondence 
with Go-like models should be expected. General the- 
ories offer only rough guidance, and few computational 
studies have compared folding pathways of Go- like mod- 
els and their "full" counterparts (in which non-native 
contact energies are included) in a broad context^. 

Very favorable interactions between segments of a pro- 
tein that arc not adjacent in the folded state generally im- 
pede folding. They might do so by introducing detours or 
traps on the route to the native state, or simply by sta- 
bilizing the ensemble of unfolded conformations^- 21 ' 22 ' . 
It is often imagined that the former possibility plagues a 
vast majority of non-n atural amino acid sequences, which 
fold sluggishly if at al l 23 ' 24 '. According to this picture, 
non-native contacts should feature prominently in the 
convoluted folding pathways of an undesigned sequence. 
Such kinetic frustration could pose several biological risks 
in vivo, where aggregation and slow response can be seri- 
ous liabilities. Indeed, typical proteins taken from living 
organisms fold reliably and with relative efficiency^. 
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These notions and observations motivate a "principle 
of minimum frustration" asserting that natural amino 
acid sequences have been "designed" by evolution to min- 
imize the disruptive influence of non-native contacts on 
the dynamics of folding 9 . One might thus apply Go-like 
models to these designed sequences with confidence, since 
the omitted interactions are precisely the ones whose ef- 
fects have been mitigated by natural selection. By con- 
trast, one might expect Go- like models to poorly rep- 
resent folding mechanisms of slowly folding molecules, 
whose non-native interactions are presumably responsi- 
ble for hampering pathways to the native stat d 21 * 2 ^ . 

Testing these ideas of sequence design and kinetic frus- 
tration is made difficult by several factors. Experimen- 
tally, microscopic details of folding kinetics cannot be re- 
solved but only inferred from indirect observables or the 
effects of mutations. Furthermore, the most concrete hy- 
potheses stemming from the principle of minimum frus- 
tration involve Go-like models, which cannot be real- 
ized in the laboratory. Computer simulations of detailed 
molecular representations can generate, at great cost, 
dynamical information sufficient to determine a folding 
mechanism for only the smallest of natural proteins^. 
Although the statistical dynamics of coarse-grained or 
schematic representations can be readily explored, biol- 
ogy does not provide collections of fast-folding and slow- 
folding sequences to compare in these artificial contexts. 
Finally, even when appropriate ensembles of sequences 
and ensembles of folding trajectories are available, use- 
ful comparison of Go-like models and its full counterpart 
requires a compact way of characterizing the course of 
highly chaotic dynamics 2 -^. A general method for this 
purpose is not available, though studies of nucleation 
as a rate- limiting fluctuation provide a useful starting 
poinPl. 

This paper presents the first systematic, large-scale 
comparison of folding pathways within Go-like and full 
models. We focus on a schematic lattice representation 
of proteins, well-suited for this task in several ways: (a) 
geometrically, because contacting segments of the chain 
can be unambiguously identified, (b) statistically, be- 
cause representative ensembles of folding trajectories can 
be generated for large numbers of amino acid sequences, 
and (c) conceptually, because the essential competition 
between contact energetics and chain connectivity can be 
isolated from complicating effects of secondary structure, 
side-chain packing, etc. While these latter effects unques- 
tionably bear in important ways on the folding of real 
proteins, it is nevertheless imperative to understand the 
fundamental physical scenarios they enrich and modify. 
Indeed, much of biologists' working intuition for protein 
folding and design was developed in the context of sim- 
ilarly schematic models. Our results challenge some of 
those notions. 

It has been conjectured that well-designed lattice het- 
eropolymers fold through mechanisms that are deter- 
mined solely by their native structures^. Were this 
hypothesis correct, for both full and Go-like models, a 



comparison of fast-folding pathways in the two models 
would not be especially informative. In that case the 
sequence of events that advance a molecule toward the 
native state (which we designate as its folding mecha- 
nism) would be exclusively a question of geometry and 
local mobility. We have found, to the contrary, that a 
wealth of folding mechanisms are possible even for a sin- 
gle native conformation. 

Spanning a range of hundreds of thousands of se- 
quences, with widely varying rates and mechanisms, the 
work reported in this paper constitutes a thorough test 
of certain aspects of the principal of minimum frustration 
and addresses at a new level of kinetic detail the dynam- 
ical realism that can be expected from Go-like models. 
Our results for the lattice heteropolymer model evidence 
a remarkably strong mechanistic correspondence between 
full and Go-like models. Unexpectedly, this dynamical 
conformity holds not only for fast-folding sequences but 
also for the slowest sequences whose folding can be fol- 
lowed in practice. Close correspondence in folding mech- 
anisms holds as long as the Go- like approximation retains 
the heterogeneity in native contact energies of the full 
potential. These findings suggest a profound frustration 
invariance in the ensemble of trajectories that proceed 
from deep within the unfolded state all the way to the 
native structure. 



II. METHODS 

We focus on lattice heteropolymers, whose folding 
properties have been studied extensively for speci fic ex- 
ample sequences, structures, and chain lengths^ 1221 . Here, 
a protein's conformation is described by a self-avoiding 
walk on a three dimensional lattice with spacing a (see 
for example Fig la). Each vertex of this walk represents 
an amino acid monomer, which possesses no internal 
structure and interacts only with "contacting" monomers 
that occupy adjacent vertices. For a chain comprising N 
monomers the energy of a particular configuration can 
thus be written 

N-l N N-3 N 

where r.y = |rj — The hard-core potential u coro (r), 
which takes on values of oo for r = and for r > 
0, imposes the constraint of self-avoidance. Interaction 
energies Bij are determined by the sequence-dependent 
identities of monomers i and j according to the model of 
Miyazawa and JernigarP^ (MJ), and act only at a spatial 
separation of one lattice spacing [A(x) = 1 if x = and 
vanishes otherwise]. 

The standard dynamical rules for evolving such a chain 
molecule proceed from a Metropolis Monte Carlo algo- 
rithm. Trial moves, in which one or two randomly se- 
lected monomers move in an "edge-flip" or "crankshaft" 
fashion, are accepted with probabilities that generate a 
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FIG. 1: (a) 48-mer native structure of the lattice heteropolymer studied in this work, (b) Example of histograms of the order 
of permanent formation of native contacts (contact appearance order, or CAO) for each of the nine native contacts of a 12-mer 
lattice structure. Histograms are collected from the set of folding trajectories of a given amino acid sequence, (c) Same as Fig. 
lb but shown as a density map. (Right, Top) CAO maps of three fast folding sequences of the 48-mer (Fig. la), for both the 
full potential energy and the Go-like approximation (which disregards non-native contact energies, but maintains the original 
heterogeneity in native contact energies) . The overlap parameter q quantifies the similarity of CAO maps, and thus topological 
folding pathways. The overlap of CAO maps between full and Go-like potentials for each sequence is close to one, q ~ 0.9, 
indicating the similarity of their folding mechanisms. In contrast, the overlap between CAO maps of different sequences is 
much smaller, q ^ 0.2. (Right, Bottom) Same as before but now for three slow folding sequences. Again, the CAO distributions 
of full and Go-like potentials are very similiar, while those between different sequences are not. 



Boltzmann distribution at temperature T = 0.16eo 
where eo sets the energy scale of the MJ model. For ex- 
ample, the strongest attractive interaction (between two 
cysteines) has an energy ecc — — 1.06eo; for lysine-lysine 
£ kk = 0.25e . Folding trajectories are initiated from 
swollen configurations drawn from a high-temperature 
(ksT '/eo = 100) equilibrium distribution in which con- 
tact energies are negligible compared to typical thermal 
excitations. 

This caricature clearly lacks many of the chemical de- 
tails underlying the function and secondary structure of 
real proteins. By capturing an essential interplay be- 
tween diverse local interactions and constraints of poly- 
mer connectivity, it nonetheless recapitulates many non- 
trivial features of protein statistical mechanics: Even for 



chains of modest length (say, N — 27), the number of 
possible conformations is sufficiently immense to moti- 
vate Levinthal's paradox, i.e., it is not obvious that they 
should be able fold at all. Folding occurs in a coopera- 
tive fashion, and occurs efficiently only for well-designed 
sequences. For a given sequence certain residues figure 
much more prominently in folding kinetics than others; 
correspondingly, certain residues are more highly con- 
served than others in computer simulations of evolution- 
ary dynamics. 



The Go-like approximation of the model of Eq. (JT|) is 
constructed simply by ignoring the energies of non-native 
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contacts, 

JV-l JV N-3 N 

E = ^ ^2<u COIe (r ij ) + J2 J2 A',,/»', ; A(r u a), (2) 

i— 1 j>i i—1 j — i+3 

where A/ij = 1 if the monomers i and j are adjacent in 
the native configuration, and Afij = otherwise. While 
disregarding the energy contribution of non-native con- 
tacts, the energy function E of Eq. |2j) retains the full 
heterogeneity in native contacts energies of the original 
potential, Eq. 0. We will show below that it is a crucial 
aspect of the Go- like models we study here. 

Many studies previously suggested that lattice het- 
cropolyme rs of modest length fold via a nucleation 
mechanis m 1 28 ! 29 !. Formation of a handful of key contacts 
poises the system at a transition state, from which the 
chain can rapidly access the folded state or, with equal 
probability, return to the unfolded state. This set of cru- 
cial contacts comprises a "folding nucleus" and serves as 
a bare synopsis of dynamical pathways that lead to the 
native state. 

A cogent comparison of folding mechanisms requires a 
means of characterizing dynamical pathways that is both 
thorough and computationally inexpensive. Identifying 
the folding nucleus satisfies neither or these necessities 
well. In particular, locating configurations from which 
the folded and unfolded states are equally accessible in- 
volves propagation of many trajectories and, by itself, 
does not delineate routes toward and away from the tran- 
sition stated We have devised an alternative measure 
that is not only succinct and computationally tractable, 
but also characterizes the entire route from the unfolded 
to the folded state. Specifically, we record the order in 
which native contacts form permanently during a pro- 
tein's folding mechanism. Our parameters thus chroni- 
cle lasting changes in the chain's "topology" , understood 
in terms of linkages through the polymer backbone and 
through non-bonded contacts. 

This contact appearance order (CAO) is a highly non- 
trivial measure of the progress toward folding and pro- 
vides a detailed characterization of mechanism in the 
sense we have defined. It is simple to calculate from 
the time-dependence of a trajectory spanning unfolded 
and folded states. Like persistence timeiP^ in the con- 
text of non-equilibdium systems, such as glasses, it is in- 
trinsically a multi-time quantity; it can neither be com- 
puted for a single configuration, nor can it be used to 
build constrained ensembles whose statistics shed light 
on the nature of reaction coordinates. But, also like per- 
sistence times^, it focuses attention on key dynamical 
events with unmatched precision. For our purpose of di- 
agnosing the occurrence of lasting topological changes, 
CAOs serve almost ideally. For some other approaches, 
e.g., surveying the free energy landscapes on which fold- 
ing takes place, CAOs would serve poorly. 

We have verified that the mechanistic meaning we as- 
cribe to CAOs is consistent with more conventional char- 
acterizations of reaction progress. Most importantly, the 



order of a contact's appearance correlates strongly with a 
statistical measure of commitment to folding at the time 
when that contact forms permanently. We use the pa- 
rameter pfoid, the probability that a trajectory initiated 
from a given configuration will reach the folded state be- 
fore first relaxing to a state with few native contacts^, 
to demonstrate this fact. Fig. 3c shows that the average 
value of Pfoid rises steadily with CAO, from a value well 
below p fold = 1/2 up to pfoi d = 1. 

The point at which pfoid crosses 1 /2 is often considered 
the transition state for folding. The set of contacts con- 
sistenly present in such configurations is correspondingly 
designated as the folding nucleus. We have confirmed 
that the nucleus identified in this way corresponds closely 
with the set of contacts that have formed permanently 
when pfoid = 1/2. Additionally, we have verified that the 
CAO-idcntified nucleus of several sequences from Mirny 
et al.l^ are consistent with the nucleus identified in that 
study. While this consistency check reflects favorably on 
the soundness of exploring folding mechanisms by scru- 
tinizing CAOs, it does not imply that CAO analysis is 
predicated on putative nucleation mechanisms for fold- 
ing. Regardless of whether the rate-determining steps 
in folding are uphill, downhill, or neutral in free energy; 
regardless of whether folding is kinetically a two-state 
phenomenon; regardless of whether the progress of fold- 
ing is plagued by long-lived kinetic traps, CAOs trace 
a history of conformational change that emphasizes any 
event with enduring topological consequences. 

What CAOs do not resolve is the unproductive devel- 
opment of native structure. Attention is focused solely 
on segments of time evolution that bridge folded and un- 
folded basins of attraction. Occasional excursions within 
the unfolded state amass an atypically large number of 
native contacts, but due either to topology or to the 
presence of interfering non-native contacts do not in fact 
make progress toward folding. CAOs contain no informa- 
tion about these excursions. In comparing full and Go- 
like models, we therefore make no statements about the 
character of such non-folding dynamics. By exclusively 
examining accumulation of native contacts, we also lose 
direct information regarding the evolution of non-native 
contacts. If the rupture of a particular non-native con- 
tact were a crucial step in folding of a certain sequence, 
our methods would not detect its occurrence explicitly. 
We stress, however, that substantial non-native structure 
is present when the first permanent native contacts are 
formed. We could therefore indirectly detect the signifi- 
cance of non-native contact dynamics through influences 
on the pattern of early topological changes. 

Compiling the order of permanent contact formation 
over many folding trajectories of a given sequence, we 
construct for each native contact a statistical distribution 
of CAO. Fig. lb,c illustrate how the set of resulting CAO 
histograms form a visual fingerprint of a sequence's fold- 
ing mechanism. Because the dynamical events it chroni- 
cles span a wide range of Pfdd , a CAO histogram charac- 
terizes not only the transition state for folding, but also 
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FIG. 2: (a) Distribution of CAO overlaps, P(q), between different sequences, and between full and Go-like potential, for 1000 
sequences chosen randomly out of 10 5 sequences that fold to the 48-mer structure of Fig. la. The sequences in this distribution 
were generated by a single high T cv evolutionary trajectory (see Appendix). The inset shows that the similarity between full 
and Go-like pathways for each sequence is independent of folding rate. Data for this inset was generated from 2000 sequences 
chosen randomly from 5 independent evolutionary runs (5 x 10 total sequences), all folding to the native 48-mer structure of 
Fig. la. (b) Distribution of the root-mean-squared fluctuations of contact order, V 8C, over the set of Go-like sequences. CAOs 
in heterogeneous Go-like potentials vary less from one folding trajectory to another than in the homogeneous Go model. It is 
the heterogeneity in native contact energies that selects specific folding pathways; this selectivity is absent in a homogeneous 
Go potential. The inset shows the CAO map of the homogeneous Go potential, cf. Fig. 1. (c) Average pfoid as a function of 
number of permanent native contacts formed, for the full and Go-like potentials, for a fast and a slow folding sequence. In all 
cases pfoid is close to zero until the first permanent contacts are made, confirming that our CAO analysis captures the relevant 
dynamical folding regime, pfoid is the probability for a given conformation to reach the folded state before unfolding. For a 
given folding trajectory, we calculate Pfow according to the method of Faisca et al.^, by running independent trajectories from 
configurations chosen at evenly-spaced time intervals. We regard a molecule as unfolded when the instantaneous number of 
native contacts drops to a value consistent with the average number of native contacts in the unfolded state. Additionally, we 
require that this threshold lie below any value found in equilibrium fluctuations of the native state. 



the dynamics of ascent to and descent from the transi- 
tion state. The correspondence between an amino acid 
sequence and its CAO histogram is as subtle as (if not 
more so) the connection between sequence and native 
conformation that defines some of the most challenging 
aspects of the protein folding problem. Most of the re- 
sults we will present concern a single native structure 
(that shown in Fig. la for N — 48), removing a po- 
tentially trivial agreement between full and Go-like mod- 
els. Even for this unique structure, sequences of the full 
model differing by only a few point mutations can ex- 
hibit qualitatively different CAO histograms, reflecting 
substantial changes in folding pathway. The distribution 
of contact energies can thus play a critical and complex 
role in determining folding mechanism, over and above 
dictating its endpoint. Given this nontrivial relationship 
it would be surprising if non-native contacts did not gen- 
erally act to shape or bias CAO statistics. 

The primary goal of this paper is to compare the CAO 
statistics of sequences propagated using full and Go-like 
models. In judging their similarities and differences, it is 
essential to establish for reference how significantly CAO 
histograms can vary, within cither model, for sequences 
that fold to a common structure. As mentioned above, 



others have proposed that such variations are weak, i.e., 
that topology of the folded structure prescribes a nearly 
unique topological route for folding. Using methods de- 
scribed in the Appendix, we have generated an unprece- 
dentedly diverse set of sequences that fold to the same 
target structure within the full model. As shown in Fig. 
1 variations in CAO statistics within this set are much 
more substantial than previously thought. Any success 
of Go-like models in reproducing folding pathways of the 
full model cannot be attributed simply to their sharing 
a common native structure. 

We quantify similarity of CAO statistics (for two se- 
quences within the same model, or for full and Go-like 
models with the same sequence) using an "overlap" pa- 
rameter (J^l. Inspired by the theory of spin glasses, we 
define q such that < q < 1, with larger q representing 
greater similarity. The analogy with spin glasses would 
assign an overlap q^ a '^ between the CAO distributions 
for two sequences a and (3 proportional to 

EE Pi a) (C)pW(C), (3) 

^max -, ~ . 
n=l C=l 

where P„ (C) is the probability that native contact n is 
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made permanently at order C in a folding trajectory of 
sequence a, and n max is the total number of native con- 
tacts. An accurate numerical estimate of the quantity in 
Eq. (|3|, however, is problematic to obtain, requiring the 
generation of an inordinate number of folding trajecto- 
ries. As an alternative, we define q using a closely related 
quantity, 




where (C% Q) = J2n=T p ™ \ G ) c is the average CAO 
of contact #n for sequence a and (a^) 2 = 
YZZY Pi a) (C){C - {C) ( n ] ) 2 is its variance. Equations 
and Q are completely equivalent in the case of Gaus- 
sian distributed CAOs. Even for non-Gaussian statis- 
tics, q( a 'P} remains a useful, computationally tractable, 
and similarly bounded measure of how similarly two se- 
quences fold. 

III. RESULTS AND DISCUSSION 

In the ensemble of sequences we generated, the fastest 
folding sequences access the native state more than 1000 
times more rapidly than the slowest. CAO histograms 
were generated for all sequences, each one evincing a well- 
defined topological pathway. Typically, the appearance 
order C of a given native contact varies from one trajec- 
tory to another by only a few positions (see below). This 
regularity belies substantial conformational fluctuations 
attending each folding event, which exert little influence 
on the formation of permanent contacts. Sharply peaked 
CAO histograms do not indicate a lack of complexity, but 
instead a successful characterization of forward progress 
along the reaction coordinate for folding. 

Figure 1 shows CAO histograms for several sequences 
folding to this specific 48-mer structure (depicted in Fig. 
la). Results are presented for dynamics propagated ac- 
cording to both full and Go-like models. Comparing 
these topological fingerprints across different sequences 
hints at the broad variety of possible folding pathways. 
Contacts essential to early stages of folding for one se- 
quence can be irrelevant in the pathway taken by another. 
This finding contrasts strongly with the "one-structure 
one-nucleus" hypothesis, bolstering recent reports of dis- 
similar folding nuclei^. 

Strong variations in the topological folding pathways 
chosen from one sequence to another immediately in- 
dicate that the original homogeneous Go modePl 

can- 
not capture the folding behavior of a typical sequence. 
With a homogeneous set of native contact energies, that 
model can only discriminate between different native 



structures, not between different sequences that adopt 
them. In loose terms folding dynamics of the homoge- 
neous Go model resemble a superposition of those we de- 
termined for diverse sequences of the full model. Whereas 
in the full model a typical set of contact energies selects 
a well-defined folding pathway, an egalitarian set of sta- 
bilizing energies permits broad sampling of routes to the 
native state. 

Go-like models that embrace variety in native contact 
energies, however, capture the topological pathways fol- 
lowed by their full model counterparts with striking ac- 
curacy. CAO histograms obtained from full and Go-like 
dynamics for any particular sequence can hardly be dis- 
tinguished, see Fig. 1. Not only are the average CAOs 
of each contact nearly equivalent, but also fine details of 
CAO statistics are unaffected by neglect of non-native 
contact energies. While previous work hypothesized a 
dynamical correspondence for fast folders, the topologi- 
cal conformity of full and Go-like mechanisms we observe 
for slow folders is highly unexpected. 

For sequences with folding rates ^ 10~ 9 , we are un- 
able to harvest folding trajectories in sufficient numbers 
to construct CAO histograms. According to microscopic 
reversibility, however, topological routes for folding are 
identical to time-reversed routes of unfolding. We have 
therefore extended our analysis of contact appearance or- 
der for efficiently folding sequences to one of contact dis- 
appearance order (CDO) for very sluggishly folding se- 
quences. The agreement between CDO histograms of 
full and Go-like models is no less striking than that of 
the CAO histograms plotted in Fig. 1, even in cases 
where the "native" state is grossly unstable. These cal- 
culations are somewhat less straigthforward: the order of 
first disappearance (CDO) is equivalent to the order of 
permanent appearance (CAO), but only for trajectories 
reaching the unfolded state without revisiting the native 
state. As such, they require specifying when a molecule 
has unfolded. For this purpose, we regard a molecule as 
unfolded when the instantaneous number of native con- 
tacts drops to a value consistent with the average number 
of native contacts in the unfolded state. Additionally, we 
require that this threshold lie below any value found in 
equilibrium fluctuations of the native state. We have 
verified that CAO and CDO histograms indeed match 
for sequences folding at moderate rates. 

Quantitative measures of mechanistic diversity are pre- 
sented in Fig. 2a. For each pair of sequences generated by 
our evolutionary simulation we computed the similarity 
parameter q between CAO histograms for the full model. 
The resulting distribution of q values is broadly peaked 
at q w 0.4, signifying that there is a significant diversity 
of CAO pathways represented by the sequences in the en- 
semble. For each individual sequence we also quantified 
the relationship between CAO histograms generated by 
full and Go- like models. These q values are distributed 
much more narrowly about a considerably higher aver- 
age, q ~ 0.9. Using sequence-to-sequence variation in 
CAO pathways as a yardstick, the irrelevance of non- 
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FIG. 3: (a) Number of native contact as a function of time in a folding trajectory, illustrating the "prefolding" (blue) and 
"folding" (red) phases of the dynamics. The prefolding phase extends from the folding trajectory's start time until the time the 
first permanent native contact is formed. The folding phase extends from this time to the time when the native conformation 
is reached. The full (green) curve shows the pfoid, which only departs from zero after the folding phase has started (cf. Fig. 2). 
(b, right panels) Distribution of the duration of the prefolding and folding phases, in the full potential and its corresponding 
Go-like approximation. For fast-folding sequences (top panel) the distributions for both folding and prefolding durations of the 
Go-like model are close to those of the full potential. For slow-folding sequences (middle panel) the Go-like model reproduces 
the distribution of folding duration, but underestimates the prefolding times. If the Go-like potential of slow-folding sequences 
is supplemented by random non-native contact energies (bottom panel) the prefolding distributions can be made to mach, 
without disrupting the correspondence in the folding phase distributions, (c) Ratio between full and Go-like models' folding 
(red) and prefolding (blue) phase durations, for all sequences ordered according to folding rate; the full lines are the average 
ratios for each scatter plot. For fast folders, the average times as calculated from the full and Go-like models are comparable, 
both for the folding and prefolding phases. For slow folders, the prefolding time in Go-like model is much smaller than that in 
the full potential, and this difference increases with decreasing folding rate. 



native contacts for the topological folding pathway is be- 
yond doubt. The inset to Fig. 2a emphasizes that this 
result has little to do with folding efficiency. Typical q 
values for the full/Go- like comparison are just as high for 
the slowest folders examined as for the fastest. 

Figure 2b quantifies the variation of CAO between 
folding trajectories. For each sequence we quantify the 
root mean-squared fluctuation in the contact order: 

i , 

SC= V^S) - <^) 2 - (5) 

^max n 
n— 1 

Fig. 2b shows the distribution of 5C among the ensemble 
of Go-like sequences. It is peaked at a value of SC w 7.5. 
In contrast, for the homogeneous Go model SC » 12.5, 
indicating that CAO values are much more broadly dis- 
tributed between trajectories (see inset to Fig. 2b). The 
homogenous Go model indeed lacks the pathway speci- 
ficity exhibited when contact energies are diverse, as in 
heterogeneous Go- like models. 



The relevance of CAOs for the folding dynamics is il- 
lustrated in Fig. 2c. For two sequen ces and their Go-like 
approximations, it plots PfoiJ 32 * 36 -* as a function of the 
total number of permanent native contacts formed, aver- 
aged over 200 folding trajectories. Pf \d gives the prob- 
ability for trajectories initiated from a particular config- 
uration to fold completely before visiting the unfolded 
state, and provides a standard basi s for defining transi- 
tion states in complex systems^l2SI Fig. 2c shows that 
Pfoid *C 1 when the first permanent contact is formed. 
Since pfoid = 1 by definition when the last permanent 
contact is formed, CAO histograms chronicle nearly the 
entire course of folding dynamics, all the way from the 
unfolded basin of attraction (pt id = 0) to the native state 
(pfold = 1)- 

Insensitivity of topological folding pathways to non- 
native contact energies by no means implies a complete 
dynamical equivalence of full and Go-like models. For 
example, a sequence's mean first passage time for fold- 
ing can differ by as many as three orders of magnitude 
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FIG. 4: The CAO correspondence between the full potential and the Go-like approximation is robust to changes in chain length 
or target native structure. (Left) CAO maps is a 12-mer folding to the structure shown in the figure. (Center) A sequence 
of the 48-mer of Fig. 1 which has a secondary stable configuration. Each target structure defines a Go-like approximation 
from the set of their native contacts. Each Go-like model predicts accurately the CAO map for folding to the corresponding 
structure. (Right) Correspondence of Go-like/full CAO maps in a 64-mer. 



for full and Go-like models. This discrepancy is larger 
for sequences with slower folding rates. Such discrep- 
ancies may be due to the presence of off-pathway traps 
in the unfolded state, and possibly non-native stabilized 
intermediates along the folding pathway. However, our 
calculations suggest that such marked distinctions are 
largely limited to dynamics occurring before the value of 
the committor function pfoid increases significantly from 
zero, i.e. before significant progress has been made along 
the folding reaction coordinate. 

As illustrated in Fig. 3a, we can divide each folding tra- 
jectory into a period before any permanent contacts are 
made (the "pre-folding phase" ) and the remaining period 
in which lasting native structure develops (the "folding 
phase"). Note that this division takes place well before 
a molecule commits to the folded state (pfoid > 1/2); 
indeed, the number of non-native contacts at the begin- 
ning of the folding phase is typically comparable to that 
of the unfolded state. Fig. 3b shows the distributions 
of pre-folding and folding phases' durations for two se- 
quences representative of fast and slow folders. In both 
cases the influence of non-native contacts on the folding 
phase dynamics is weak. Non-native contacts mildly ex- 
tend the time required to complete folding after the first 



permanent contact is made, by less than an order of mag- 
nitude. By contrast, pre-folding dynamics of poorly de- 
signed sequences are quite sensitive to non-native contact 
energies. For the example shown in the middle panel of 
Fig. 3b, the waiting period prior to formation of a single 
permanent contact is roughly three orders of magnitude 
longer in the full model as in the Go-like model. No such 
dilation is observed for sequences that fold quickly in the 
full model. 

Because contact appearance order is a sensitive mea- 
sure of approach to the dynamical bottleneck for fold- 
ing, our division of pre-folding and folding phases is a 
kinctically meaningful one. Most importantly, p{ \d <C 1 
throughout pre-folding dynamics as seen in Fig. 3a, in- 
dicating that the system remains well within the un- 
folded basin of attraction. Only when permanent con- 
tacts are made does pfoid rise significantly, so that the 
folding phase encompasses entirely departure from the 
unfolded state and transit to the native structure. It is 
remarkable that non-native contacts, which can substan- 
tially prolong dwell times in the unfolded state, exert 
no discernible influence on the topological folding order, 
and only a small effect on the duration of folding phase 
dynamics. 
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Our simulations suggest that progress toward the na- 
tive state is essentially orthogonal to the formation and 
rupture of non-native contacts. A number of such con- 
tacts are certainly present over much of the course of fold- 
ing, but they do little to decide what conformational rear- 
rangements bring a chain closer to its transition state for 
folding. To further test this idea, we studied folding dy- 
namics governed by potential energy functions that com- 
bine aspects of full and Go-like models. Specifically, we 
selected a set of non-native contact energies at random 
from a Gaussian distribution, see Fig. 3b. The "frus- 
trating" influence of these random energies match pre- 
cisely the behavior we have reported for the full model: 
CAO histograms are completely insensitive to the aver- 
age strength and variance of non-native attractions, while 
overall folding rates decrease with increasing non-native 
attraction strength. 

The observation of correspondence between dynam- 
ics of the full lattice model and that of a heterogeneous 
Go-like approximation does not noticeably depend upon 
chain length or on details of native structure. We have 
generated sequences with a range of folding rates for sev- 
eral native conformations of chains with lengths 8, 12, 48, 
and 64. For the two shortest chains, we used each max- 
imally compact lattice structure as a folded state. For 
the two longest chains, we studied several native struc- 
tures varying significantly in compactness and in contact 
ordei^3. Typical results shown in Fig. 4 highlight that 
the fidelity of Go-like folding mechanisms is a very gen- 
eral feature of these lattice heteropolymers. 



IV. CONCLUSIONS 

Several arguments have been presented in the liter- 
ature to justify the use of Go models in studying the 
folding mechanisms of real proteins. Most commonly as- 
serted (based on the principle of minimum frustration) 
is that evolutionary optimization of real sequences re- 
moves kinetic barriers and renders the energ y landscape 
smoothly funneled and therefore Go-like^^. Biases due 
to topological features of the native state, unchanged in a 
protein's Go-like represen tion, have also been invoked to 
justify mechanistic fidelityS^Sl. Our results demonstrate, 
however, that neither of these assumptions need hold for 
a Go-like model to reproduce in fine detail the topological 
ordering of folding events of a lattice heteropolymer. 

Robustness of the detailed mechanism for folding to 
omission of non-native contacts is not a consequence of 
sequence design within the schematic lattice models we 
have studied. It is a fundamental emergent feature of 
their statistical dynamics, independent of folding effi- 
ciency over the entire range accessible to our numerical 
simulations. Rather than introducing kinetic roadblocks 
that reshape transition states for folding, energetic diver- 
sions due to non-native contacts appear to strongly affect 
only physical properties of the unfolded state. Even the 
duration of trajectory segments that span folded and un- 



folded states is essentially determined by native energies 
alone, despite the fact that substantial non-native struc- 
ture must be disrupted en route. 

Lattice heteropolymers are perhaps the crudest rep- 
resentation of protein mechanics to which our analysis 
could be meaningfully applied. The correspondence be- 
tween full and Go-like folding mechanisms we have re- 
vealed might break down in more detailed models. For 
example, it has been reported that lattice heteropoly- 
mers do not exhibit glassy folding dynamics even at very 
low temperatures, while non-Arrhenius temperature de- 
pendence naturally arises in slightly elaborated models 
that describe side chain packing in addition to back- 
bone conformation^. Go-like energetics could alter fold- 
ing pathways by abating the frustration underlying such 
glassy relaxation. This possibility, which merits further 
investigation, does not however negate the significance 
of our findings. Our primary purpose is not to justify 
the use of Go-like models for detailed study of real pro- 
teins' folding mechanisms. It is instead to establish the 
influence of non-native interactions on dynamics intrinsic 
to the fundamental interplay between chain connectivity 
and heterogeneous contact interactions. That interplay, 
whose understanding is central to any instructive physi- 
cal picture of protein folding, is not just present in simple 
lattice models - it is the exclusive source of their com- 
plexity. The results we have presented therefore establish 
an important point: Mechanistic aspects of protein fold- 
ing that arise from the basic physics of heteropolymer 
freezing are remarkably insensitive to non-native struc- 
ture. 
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V. APPENDIX 

Our method of sequence generation, which effects a bi- 
ased random walk in the space of all possible sequences, 
is an extension of the method of Mirny et al.^. To gen- 
erate ensembles of sequences folding to a specific native 
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structure, we introduce random point mutations and ac- 
cept them with a Metropolis probability 



1, exp 



T 



(6) 



that generates a Boltzmann-like distribution. Here, 
Aft'"' is an estimated activation free energy for fold- 
ing of sequence a, fcW = k exp(— AF^^ /k B T). We 
estimate the folding rate constant k^ a ' for sequence a, 
relative to the rate of basic microscopic motions fc , by 
computing the fraction of trajectories (/ifoid)r ~ 1 — 
exp(— fc^ a V) that fold within a fixed amount of time r 
(with <C t -1 <C k ). This strategy offers two dis- 
tinct advantages: (1) the evolutionary temperature T ev , 
which governs the stringency of selection for efficient fold- 
ing, can be controlled systematically; and (2) estimates 
of folding efficiency via (/if id)r can converge much more 



rapidly than mean first passage time calculations em- 
ployed in Mirny et alJ^. 

Our evolutionary simulations, conducted at moder- 
ate "temperature" T cv = 0.05 /fcs, demonstrate that 
in fact many folding pathways can provide efficient ac- 
cess to a single native state. It is therefore not at all 
self-evident that a particular, well-designed amino acid 
sequence should arrive at its native structure via simi- 
lar routes in full and Go-like versions of the lattice het- 
eropolymer model. 

Using this method, we have generated hundreds of 
thousands of sequences which fold to given structures 
(for example that of Fig. la) through a variety of folding 
mechanisms. This is the ensemble of sequences we use in 
this paper. Further details of the evolutionary dynamics 
used to generate these large ensembles of sequences will 
be given in a forthcoming publication^!. 
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